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^ ■ Abstract 

' In several online prediction problems of recent interest the comparison class is composed of 

matrices with bounded entries. For example, in the online max-cut problem, the comparison 
class is matrices which represent cuts of a given graph and in online gambling the comparison 
class is matrices which represent permutations over n teams. Another important example is 
online collaborative filtering in which a widely used comparison class is the set of matrices 
with a small trace norm. In this paper we isolate a property of matrices, which we call ([3, r)- 
decomposability, and derive an efficient online learning algorithm, that enjoys a regret bound of 
0(y//3rT) for all problems in which the comparison class is composed of (/3, -^-decomposable 
' matrices. By analyzing the decomposability of cut matrices, triangular matrices, and low trace- 

norm matrices, we derive near optimal regret bounds for online max-cut, online gambling, and 
online co llaborative filteri n g. In particular , this resolves (in the affirmative) an open problem 
q . posed bv lAbernethv 2010| . Kleinberg et al. 2010| . Finally, we derive lower bounds for the three 



H-l 



> 



problems and show that our upper bounds are optimal up to logarithmic factors. In particular, 
our lowe r bound for the online co llaborative filtering problem resolves another open problem 
posed bv lShamir and Srebrol 2011 1. 



1 Introduction 



■^J - ! We consider online learning problems in which on each round the learner receives (it,jt) £ [ m ] x [n] 

and should return a prediction in [—1,1]. For example, in the online collaborative filtering problem, 
m is the number of users, n is the number of items (e.g., movies), and on each online round the 
learner should predict a number in [—1,1] indicating how much user it £ [m] likes item jt £ [n]. Once 
the learner makes the prediction, the environment responds with a loss function, It : [—1,1] — >• M, 
r> ■ that assesses the correctness of the learner's prediction. 

A natural approach for the learner is to maintain a matrix Wt £ [—1, l] mxn 5 and to predict the 
corresponding entry, Wt(it,jt)- The matrix is updated based on the loss function and the process 
continues. 

Without further structure, the above setting is equivalent to mn independent prediction prob- 
lems - one per user-item pair. However, it is usually assumed that there is a relationship between 
the different matrix entries - e.g. similar users prefer similar movies. This can be modeled in 
the online learning setting by assuming that there is some fixed matrix W, in a restricted class 
of matrices W C [— l,l] mxn , such that the strategy which always predicts W(it,jt) has a small 
cumulative loss. A common choice for W in the collaborative filtering application is to be the set 
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of matrices with a trace norm of at most r (which intuitively requires the prediction matrix to be 
of low rank). As usual, rather than assuming that some W £ W has a small cumulative loss, we 
require that the regret of the online learner with respect to W will be small. Formally, after T 
rounds, the regret of the learner is 

T T 

Regret := J>(Wt(%, j t )) - ™* ^>(W(%, jt)), 
t=i e t=i 

and we would like the regret to be as small as possible. 

A natural question is what properties of W enables us to derive an efficient online learning 
algorithm that enjoys low regret, and how does the regret depend on the properties of W. In 
this paper we define a property of matrices, called (/3, r)-decomposability, and derive an efficient 
online learning algorithm that enjoys a regret bound of 0(V ' f3 tT) for any problem in which W C 
[— 1, l] mxn and every matrix W 6 W is (/?, r)-decomposable. Roughly speaking, W is (/3, in- 
decomposable if a symmetrization of it can be written as P — N where both P and N are positive 
semidefinite, have sum of traces bounded by r, and have diagonal elements bounded by f3. 

We apply this technique to three online learning problems. 

1. Online max-cut: On each round, the learner receives a pair of graph nodes (i, j) G [n] x [n], 
and should decide whether there is an edge connecting i and j. Then, it receives a binary 
feedback. The comparison class is the set of all cuts of the graph, which can be encoded as 
the set of matrices {W^ : A C [n]}, where Wa(i, j) indicates if crosses the cut defined 
by A or not. It is possible to achieve a regret of 0(V nT) for this problem by a non-efficient 
algorithm (simply refer to each A as an expert and apply a prediction with expert advice 
algorithm). Our algorithm yields a nearly optimal regret bound of 0{sjn log(n)T) for this 
problem. This is the first efficient algorithm that achieves near optimal regret. 

2. Online Gambling: On each round, the learner receives a pair of teams (i, j) £ [n] x [n], 
and should predict whether i is going to beat j in an upcoming matchup or vice versa. The 
comparison class is the set of permutations over the teams, where a permutation will predict 
that i is going to beat j if i appears before j in the permutation. Permutations can be encoded 
naturally as matrices, where W(i,j) is either 1 (if i appears before j in the permutation) or 
0. Again, it is possible to achieve a regret of 0{^Jn log(n)T) by a non-efficient algorithm 
(that simply treats each permutation as an expert). Our algorithm yields a nearly optimal 



regret bound of Q(\/ n log 3 (n)T). This resolves an open problem posed in lAbernethvl |2O10l ] 



Kleinberg et~aH |20ld ]. Achieving this kind of regret bound was widely considered intractable, 



since computing the bes t permutation in hindsight i s exactly the NP-hard minimum feedback 
arc set problem. In fact, Kanade and Steinke 20121 ] tried to show computational hardness for 



this problem by reducing the problem of online agnostic learning of halfspaces in a restricted 
setting to it. This paper shows that the problem is in fact tractable. 

3. Online Collaborative Filtering: We already mentioned this problem previously. We con- 
sider the comparison class W = {W E [— 1, l] mxn : ||W||* < r}, where || • ||* is the trace 
norm. Without loss of generality assume m < n. Our algorithm yields a nearly optimal 
regret bound of 0(^/T^/n^og(nyr). Since for this problem one typically has r = 0(n), we 
can rewrite the regret bound as 0{\f re 3 / 2 log(n)T). In contrast, a direct application of the 
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online mirror descent framework to this problem yields a reg ret of 0{Vr I f) = 0{V^f). 
The latter is a trivial bound since the bound becomes meaningful only after T > n 2 rounds 
(which means that we saw the entire matrix). 



Recently, ICesa-Bianchi and Shamirl [20111 ] proposed a rather different algorithm with regret 
bounded by 0{ry/n) but under the additional assumption that ea ch entry (i, j) is seen only 
once. In addition, while both the runtime of our method and the ICesa-Bianchi and Shamir 



20111 ] method is polynomial, the runtime of our method is significantly smaller: for m ~ n, 
each iteration of our method can be implemented in 0(n 3 ) time (see Section [6]), whereas the 
runtime of each iteration in their algorithm is at least f2(n 4 ) and can be significantly larger 
depending on the specific implementation^ 

Finally, we derive (nearly) matching lower bounds for the three problems. In particular, our 
lower bound for the online collaborative filtering problem implies that the sample complexity of 
learning matrices with bounded entries and trace norm of Q(n) is f?(n 3 / 2 ). Th is matches an upper 
bound on the sam ple complexity derived bylShamir and Shalev-Shwarta 201 ll ] and solves an open 
problem posed by Shamir and Srebrol 2011 ] . 



2 Problem statements and main results 

We start with the definition of (/3, r)-decomposability. For this, we first define a symmetrization 
operator. 

Definition 1 (Symmetrization). Given an m x n non-symmetric matrix W its symmetrization is 
the (m + n) x (m + n) matrix: 

W 



sym(W) := 



W 







If m = n and W is symmetric, then sym(W) := W. 

The main property of matrices we rely on is (/?, r)-decomposability, which we define below. 

Definition 2 ((/?, r)-decomposability). An m x n matrix W is (f5,r) -decomposable if there exist 
symmetric, positive semidefinite matrices P,N £ W xp , where p is the order o/sym(W), such that 
the following conditions hold: 

sym(W) = P-N, 
Tr(P) + Tr(N) < r, 

Vi G [p] : P(i,i),N(i,i) < p. 

We say that a set of matrices W is (/?, r)- decomposable if every matrix in W is (/5, r)- decomposable. 



1 Specifically, each iteration in their algorithm requires solving n empirical risk minimization problems over the 
hypothesis space ofmxn matrices with a bounded trace norm (in their notation, to obtain the optimal bound, one 
should set T = n 2 and r\ > 1/n, and then should solve r]T empirical risk minimization problems per iteration). It is 
not clear what is the optimal runtime of solving each such empirical risk minimization problem. We believe that it 
is impossible to obtain a solver which is significantly faster than n 4 . 
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In the above, the parameter (3 stands for a bound on the diagonal elements of P and N, while 
the parameter r stands for the trace of P and N. It is easy to verify that if W is (/3, r)-decomposable 
then so is its convex hull, conv(W). Throughout this paper, we assume for technical convenience 
that (3 > li 

There is an intriguing connection between the r)-decomposition for a rectangular matrix W 
and its max-norm and trace norm: the least possible (3 in any (/3, r)-decomposition exactly equals 
half the max-norm of W (see Theorem 12 ip . and the least possible r in any (/3, -^-decomposition 
exactly equals twice the trace-norm of W (see Theorem l23l) . 

Our first contribution is a generic low regret algorithm for online matrix prediction with a 
(/3, r)-decomposable comparison class. We also assume that all the matrices in the comparison 
class have bounded entries. Formally, we consider the following problem. 

Online Matrix Prediction 
parameters: /3 > 1, r > 0, G > 

input: A set of matrices, W C [—1, l] mxn ) which is (/3, r)-decomposable 

for t = 1,2,... ,T 
adversary supplies a pair of indices (it,jt) £ [m] x [n] 
learner picks W t £ conv(W) and outputs the prediction Wt(it,jt) 
adversary supplies a convex, G-Lipschitz, loss function l t : [— 1, 1] — > R 
learner pays £t(W t (i t ,jt)) 



Theorem 1. There exists an efficient algorithm for Online Matrix Prediction which enjoys the 
regret bound 

Regret < 2Ga/t/3 log(2p)T, 
where p is the order o/sym(W) for any matrix W S W. 

The Online Matrix Prediction problem captures several specific problems considered in the 
literature, given in the next few subsections. 

2.1 Online Max-Cut 

Recall that on each round of online max-cut, the learner should decide whether two vertices of a 
graph, (it,jt) are joined by an edge or not. The learner outputs a number yt 6 [—1, 1] which is 
to be interpreted as a randomized prediction in { — 1, 1}: predict 1 with probability -4^ and — 1 
with the remaining probability. The adversary then supplies the true outcome, yt E { — 1, 1}, where 
yt = 1 indicates the outcome u (it,jt) are joined by an edge", and yt = —1 the opposite outcome. 
The loss suffered by the learner is the absolute loss, 

4(yt) = -Ayt - yt\, 

which can be also interpreted as the probability that a randomized prediction according to yt will 
not equal the true outcome yt- 

2 The condition /3 > 1 is not a serious restriction since for any (/?, -^-decomposition of W, viz. sym(W) = P N, 
we have /3 > \P(i,j)\, \N(i,j)\ for all since P,N y 0; and so 2/3 > \P(i,j) - N(i,j)\ = \W(i,j)\. Thus, if we 

make the reasonable assumption that there is some W £ W with \W(i, j)\ = 1 for some (i, j), then /3 > \ is necessary. 
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The comparison class is W = {W^jyl C [n]}, where 
W A (i,j) -- 



1 if ((i G A) and (j ^ A)) or ((j G A) and (i £ A)) 
— 1 otherwise. 



That is, WA.(i,j) indicates if crosses the cut defined by A or not. The following lemma (proved 
in Appendix [C]) formalizes the relationship of this online problem to the max-cut problem: 

Lemma 2. Consider an online sequence of loss functions {it} as above. Let 

W* = arg min t ^it{W(i t J t )) ■ 

Then W* = for the set A that determines the max cut in the weighted graph over [n] nodes 
whose weights are given by = Ylt:{h,h)=(hi) Vt ^ or every 

A regret bound of 0{-\/nT) is attainable for this problem as follows via an exponential time 
algorithm: consider the set of all 2 n cuts in the graph. For each cut defined by A, consider a 
decision rule or "expert" that predicts according to the matrix W^. Standard bounds for the 
experts algorithm imply the O(VnT) regret bound. 

A simple way to get an efficient algorithm is to replace W with the class of all matrices in 
{ — 1, i} nxn . This leads to n 2 different prediction tasks, each of which corresponds to the decision if 
there is an edge between two nodes, which is efficiently solvable. However, the regret with respect 
to this larger comparison class scales like 0(V n 2 T). 

Another popular approach for circumventing the hardness is to replace W with the set of 
matrices whose trace-norm is bounded by r = n. However, applying the online mirror descent 
algorithmic framewo rk with an appropriate squared-Schatten norm regularization, as described in 
Kakade et alll20ld |. eads to a regret bound that again scales like 0(Vn 2 T). 



In contrast, our Online Matrix Prediction algorithm yields an efficient solution for this problem, 
with a regret that scales like \Jn log(n)T. The regret bound of the algorithm follows from the 
following: 

Lemma 3. W is (l,n)-decomposable. 

Combining the above with Theorem Q] yields: 

Corollary 4. There is an efficient algorithm for the online max-cut problem with regret bounded 
by 2 y / n\og(n)T. 

We prove (in Appendix [5]) that the upper bound is near-optimal: 

Theorem 5. For any algorithm for the online max-cut problem, there is a sequence of entries 
(it,jt) and loss functions it for t = 1,2,...,T such that the regret of the algorithm is at least 
y/nT/16. 
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2.2 Collaborative Filtering with Bounded Trace Norm 

In this problem, the comparison set W is the following set of m x n matrices with trace norm 
bounded by some parameter r: 

W := {W 6 [—1, l] mxn : ||W||*<r}. (1) 

Without loss of generality we assume t hat m < n. 



As before, applying the technique of lKakade et al.l [20101 ] leads to a regret bound that scales as 



V r 2 T, which leads to trivial results in the most relevant case where r = 0(y / mra). In contrast, we 
can obtain a much better result based on the following lemma. 

Lemma 6. The class W given in ^ is (y/m + n, 2r)- decomposable. 

Combining the above with Theorem Q] yields: 

Corollary 7. There is an efficient algorithm for the online collaborative filtering problem with 
regret bounded by 2Gy/2r\/n + mlog(2(m + n))T), assuming that for all t the loss function is G- 
Lipschitz. 

This upper bound is near-optimal, as we can also show (in Appendix [5]) the following lower 
bound on the regret: 

Theorem 8. For any algorithm for online collaborative filtering problem with trace norm bounded 
by t, there is a sequence of entries (it,jt) and G-Lipschitz loss functions it for t = 1,2, ... ,T such 

that the regret of the algorithm is at least G\l\ryfnT. 



In fact, the technique used to prove the above lower bound also implies a lower bound on the 
sample complexity of collaborative filtering in the batch setting (proved in Appendix [5]). 

Theorem 9. The sample complexity of learning W in the batch setting, is Q(T\/n/e 2 ). In partic- 
ular, when t = Q(n), the sample complexity is 0(?i L5 / e 2 ) . 



This matches an upper bound given by IShamir and Shalev-Shwarta [20111 ] . The question of 



determining the sample complexity of W in the batch setting has been posed as an open problem 
by Shamir (who conjectured that it scales like n 1 ' 5 ) and Srebro (who conjectured that it scales like 
n 4 / 3 ). 

2.3 Online gambling 

In the gambling problem, we define the comparison set W as the following set ofnxn matrices. 
First, for every permutation ir : [n] — > [n], define the matrix W,,- as: 



1 if ?r(i) < 7r(j) 
otherwise. 



Then the set W is defined as 

W := {W„- : 7r is a permutation of [n]}. (2) 
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On round t, the adversary supplies a pair (it,jt) with it ^ jt, and the learner outputs as a prediction 
y t = Wt(it,jt) G [0, 1], where we interpret yt as the probability that it will beat j t . The adversary 
then supplies the true outcome, yt € {0, 1}, where yt = 1 indicates the outcome c % beats jt" , and 
yt = the opposite outcome. The loss suffered by the learner is the absolute loss, 

£t{yt) = \yt-yt\, 

which can be also interpreted as the probability that a randomized prediction according to yt will 
not equal to the true outcome yt- 

As before, we tackle the problem by analyzing the decomposability of W. 

Lemma 10. The class W given in (dp is (0(log(n)), O(n\og(n)))-decomposable. 

Combining the above with Theorem Q] yields: 

Corollary 11. There is an efficient algorithm for the online gambling problem with regret bounded 
by 0{^n log 3 (n)T). 

This upper bound is near-optimal, as iKleinberg etaD |2O10l ] essentially prove the following lower 
bound on the regret: 

Theorem 12. For any algorithm for the online gambling problem, there is a sequence of en- 
tries (it,jt) and labels yt, for t = 1,2, ... ,T, such that the regret of the algorithm is at least 
n(y/n\og{n)T). 



3 The Algorithm for Online Matrix Prediction 

In this section we prove Theorem Q] by constructing an efficient algorithm for Online Matrix Pre- 
diction and analyze its regret. We start by describing an algorithm for Online Linear Optimization 
(OLO) over a certain set of matrices and with a certain set of linear loss functions. We show 
later that the Online Matrix Prediction problem can be reduced to this online convex optimization 
problem. 



3.1 The (/3,r, 7 )-OLO problem 

In this section, all matrices are in the space of real symmetric matrices of size N x N, which we 
denote by S NxN . 

On each round of online linear optimization, the learner chooses an element from a convex set 
IC and the adversary responds with a linear loss function. In our case, the convex set IC is a subset 
of the set of matrices with bounded trace and diagonal values: 

IC C {X € S NxN : X h 0, Vi G [N] : X u < p, Tr(X) < r}. 

We assume for convenience that ^1 E IC. The loss function on round t is the function X h- > 
X • Lj = f j X(i,j)Lt(i,j), where L( is a matrix from the following set of matrices: 

£ = {Le S NxN : L 2 d = LL is a diagonal matrix s.t. Tr(L 2 ) < 7}. 
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We call the above setting a (/?, 7, t)-OLO problem. 
As usual, we analyze the regret of the algorithm 



T 



Regret := Xj • L< — min X • L 



where Xi, . . . , Xy are the predictions of the learner. 

Below we describe and analyze an algorithm for the (/3,7, t)-OLO prob lem- 



forms of which independently appeared in the work of iTsuda et al.l 20061 ] and lArora and Kale 



The algorithm 



20071 ] . performs exponentiated gradient steps followed by Bregman projections onto /C. The pro- 



jection operation is defined with respect to the quantum relative entropy divergence: 



A(X, A) = Tr(Xlog(X) - Xlog(A) - X + A). 



Algorithm 1 Matrix Multiplicative Weights with Quantum Relative Entropy Projections 
1: Input: rj 
2: Initialize Xi = -j^I. 
3: for t = 1,2, ... ,T: do 
4: Play the matrix X^. 
5: Obtain loss matrix L^. 

6: Update X m = argmin X6 /c A(X, exp(log(X t ) - rjL t )). 
7: end for 



Algorithm[T]has the following regret bound (essentially following Tsuda et al. 20061 ] . Arora and Kale 



20071 ] . also proved in Appendix [X] for completeness): 



Theorem 13. Suppose r/ is chosen so that T]\\Lt\\ < 1 for all t (where ||Lt|| is the spectral norm of 
lit). Then 

r, , ^ V^y T 2 , Tl °g( Ar ) 

Regret < n > X t • L t H . 

^— ' 7? 

t=i 1 

Equipped with the above we are ready to prove a regret bound for 7, r)-OLO. 

Theorem 14. Assume T > T lo& ( N ^ . Then, applying Algorithm^ with rj = J T ^j^ " on a 7, r)- 
OZO problem yields an efficient algorithm whose regret is at most 2\J (3^t log(iV)T. 

Proof. Clearly, Algorithm [T]) can be implemented in polynomial time since the update of step [6] is 
a convex optimization problem. To analyze the regret of the algorithm we rely on Theorem [T3J By 
the definition of K, and C, we get that X^ • < ^7. Hence, the regret bound becomes 

Regret < r)(3-yT + ^—i-. 

V 

Substituting the value of rj, we get the stated regret bound. One technical condition is that the 
above regret bound holds as long as i] is chosen small enough so that for all t, we have 7?||Lt|| < 1. 
Now 1 1 Lt 1 1 < 1 1 Li 1 1 p = ^/Tr(L|) < ^7. Thus, for T > Tl °s( jy ) ; th e technical condition is satisfied 



S 



3.2 An Algorithm for the Online Matrix Prediction Problem 



In this section we describe a reduction from the Online Matrix Prediction problem (with a (/3, r)- 
decomposable comparison class) to a (/?, 4G 2 , r)-OCO problem with iV = 2p. The regret bound of 
the derived algorithm will follow directly from Theorem 1141 

We now describe the reduction. To simplify our notation, let q be m if W contains non- 
symmetric matrices and q = otherwise. Note that the definition of sym(W) implies that for a 
pair of indices G [m] x [n], their corresponding indices in sym(W) are (i,j + q). 

Given any matrix W G W we embed its symmetrization sym(W) (which has size p x p) into 
the set of 2p x 2p positive semidefinite matrices as follows. Since W admits a (f3, ^-decomposition, 
there exist P,N y such that sym(W) = P - N, Tr(P) + Tr(N) < r, and for all i G hp], 
P(i,i), N(i,i) < (3. The embedding of W in § 2 p x2 p, denoted <fi(W), is defined to be the matrix^ 



0(W) 



P 

N 



It is easy to verify that <^>(W) belongs to the convex set K. defined below: 



K := | XGS 2px2p s.t. 

X y (3) 
Vi G [2p] : X(i,i) < p 
Tr(X) < r 

V(i,j) € [m] x [n] : (X(i,j + q)-X(p + i,p + j + q)) G[-l,l] | 

We shall run the OLO algorithm with the set fC. On round t, if the adversary gives the pair 
(Uijt), then we predict 

y t = X t (it,j t + q) - X t (p + i t ,p + j t + q) . 

The last constraint defining K, simply ensures that yt G [—1,1]. While this constraint makes the 
quantum relative entropy projection onto 1C more complex, in Appendix [6] we show how we can 
leverage the knowledge of (it,jt) to get a very fast implementation. 

Next we describe how to choose the loss matrices Lt using the subderivative of if Given the 
loss function it, let g be a subderivative of it at yt. Since it is convex and G-Lipschitz, we have 
that \g\ < G. Define L t G S 2px2p as follows: 



Lt(i,j) 



9 if = (hjt+q) or = (jt+q,it) 

~9 if = (P + H,P + jt + q) or = (p + j t + q,p + k) (4) 
otherwise. 



Note that L 2 is a diagonal matrix, whose only non-zero diagonal entries are {it + q,it + q), (jt + 
q,jt+q), (p+H+q,P+it+q), and (p+jt + q,P+jt + q), all equalling g 2 . Hence, Tr(L 2 ) = 4g 2 < 4G 2 . 



3 Note that this mapping depends on the choice of P and N for each matrix W £ W. We make an arbitrary choice 
for each W. 
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To summarize, the Online Matrix Prediction algorithm will be as follows: 



Algorithm 2 Matrix Multiplicative Weights for Online Matrix Prediction 
1: Input: /3,T,G,m,n,p,q (see text for definitions) 
2: Set: 1 = 4G\N = 2p, V =^p 
3: Let K, be as defined in ([3]) 
4: Initialize Xi = -^1. 
5: for t = 1,2, ... ,T: do 

6: Adversary supplies a pair of indices (U,jt) £ [m] x M- 
7: Predict y t = X t (i t ,j t + q) - X t (p + i t ,p + j t + q). 
8: Obtain loss function if : [—1, 1] — > M. and pay £t(yt)- 
9: Let g be a sub-derivative of It a t Vt 
10: Let L t be as defined in @ 

11: Update X m = argmin X6 /c A(X, exp(log(X t ) - rjL t )). 
12: end for 



To analyze the algorithm, note that for any W £ W, 

<XW) . U = 2g(P(it,j t ) ~ N(it,jt)) = 2gW(it,j t ), 

and 

X t »L 4 = 2g(X t (i t ,j t + q) - X t (p + i t ,p + j t + q)) = 2gy t . 
So for any W S W, we have 

X t «L t -0(W)«L t = 2 5 (y t - 

> 2(4(m)-4(^(^,i t ))), 

by the convexity of it(-)- This implies that for any W £ W, 



T 1 



^X t .L t -,4(W).Lt 



< --Regret 0L0 . 



Thus, the regret of the Online Matrix Prediction problem is at most half the regret in the (/?, 4G 2 , r)- 
OLO problem. 

3.2.1 Proof of Theorem [TJ 

Following our reduction, we can now appeal to Theorem Q31 For T > t1 °s( 2 p) , the bound of 



Theorem 1141 applies and gives a regret bound of 2G\J t(3 log(2p)T. For T < 



/3 

t log(2p) 

/3 



, note that in 



any round, the regret can be at most 2G, since the subderivatives of the loss functions are bounded in 
absolute value by G and the domain is [—1,1], so the regret is bounded by 2GT < 2G\J r/3 log(2p)T 
since f3 > 1. Thus, we have proved the regret bound stated in Theorem [TJ 
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4 Decomposability Proofs 



In this section we prove the decomposability results for the comparison classes corresponds to max- 
cut, collaborative filtering, and gambling. All the three decompositions we give are optimal up to 
constant factors. 

4.1 Proof of Lemma [3] (max-cut) 

We need to show that every matrix £ W admits a (1, n)-decomposition. We can rewrite 
W^4 = -w/iw T where £ M. n is the vector such that 

l if i e A 
— 1 otherwise. 

Since is already symmetric, sym(W^) = = — w^wj^. Thus we can choose P = and 
N = w^w^. These are positive semidefinite matrices with diagonals bounded by 1 and sum of 
traces equals to n, which concludes the proof. Since T^w^wjj = n, this (1, ra)-decomposition is 
optimal. 

4.2 Proof of Lemma [6] (collaborative filtering) 

We need to show that every matrix W £ W, i.e. an mxn matrix over [—1, 1] with ||W||* < r, admits 
a (y/m + n, 2r)-decomposition. The (y/m + n, 2r)-decomposition of W is a direct consequence of 
the following theorem, setting Y = sym(W), with p = m + n, and the fact that ||sym(W)||* = 
2||W||* (see Lemma [T9l) . 

Theorem 15. Let Y be a p x p symmetric matrix with entries in [—1, 1]. Then Y can be written 
as Y = P — N where P and N are both positive semidefinite matrices with diagonal entries bounded 
by y/p, and Tr(P) + Tr(N) = ||Y||*. 

Proof. Let 

Y = ^A iVi v7 

i 

be the eigenvalue decomposition of Y. We now show that 

p = ^2 ^ v * v » T and N = ^2 -^ v * v i r 

i: Ai>0 i: \i<Q 

satisfy the required conditions. Clearly Tr(P)+Tr(N) = J2i M = l|Y||*. Define abs(Y) = P+N = 
Yli \^i\ v i v J ■ Note tnat 

abs(Y) 2 = ^Af Vi v7 = Y 2 . 

i 

We now show that all entries (and in particular, the diagonal entries) of abs(Y) are bounded in 
magnitude by ^/p. Since P and N are both positive semidefinite, their diagonal elements must be 
non-negative, so we conclude that the diagonal entries of P and N are bounded by ^Jp as well. 

Since all the entries of Y are bounded in magnitude by 1, it follows that all entries of Y 2 are 
bounded in magnitude by p. In particular, the diagonal entries of Y 2 are bounded by p. Since 
these diagonal entries are equal to the squared lengths of the rows of abs(Y), it follows that each 
entry of abs(Y) is bounded in magnitude by ^fp. □ 
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This decomposition is optimal up to constant factors. Consider the matrix W formed by taking 
m = ^= rows of an n x n Hadamard matrix. In Theorem [2D] (proved in Appendix [D]) , we prove 

that any (/?, f)-decomposition of sym(W) must have f3f > ^Ty/n. Since the regret bound depends 
on the product /3f, we conclude that the decomposition obtained from Theorem [TS] is optimal up 
to a constant factor. 



4.3 Proof of Lemma [TU] (gambling) 

We need to show that every matrix W € W, i.e. an n x n matrix W, for some permutation 
7r : [n] — > [n], admits a (0(log(n)), 0(n log(n)))-decomposition. One minor change that needs to 
be made to Algorithm [2] is that the last constraint in (|3|) needs to be changed to 

V(i, j) € [n] X [n] : (X(i,j + q)-X(p + i,p + j + q)) 6 [0,1], 

to ensure that the prediction lies in [0, 1] rather than [—1, 1]. The analysis remains intact, and so 
does the regret bound. 

We now give the decomposition. The following upper triangular matrix T plays a pivotal role: 



if *' < 3 
otherwise. 



The reason this matrix is so important is because any matrix W,,- is obtained by permuting the 
rows and columns of T. In particular, let P^ be the permutation matrix defined by the permutation 
7r, i.e. 

1 if j = ir(i) 
otherwise. 



Then it is easy to check that 



P.TP 



T 



Using this fact, we get 

P, 
P, 



sym(T) 



P T 






P T 



Q, 



P^ 
P^ 



P.T T Pj 

o w, 

WJ 







T 




P T 

TT 







P T 



P TP T 




sym(W 7r ). 



Now, note that Q w is a permutation matrix (viz. the one defined by the permutation tt' : [2n] — > [2n] 
defined as vr'(i) = ir(i) for 1 < i < n, and vr'(i) = ir(i — n) + n for n < i < 2n). Thus, if T admits a 
(j3, r)-decomposition, sym(T) = P — N, then 

sym(W w ) = Q,sym(T)QT = Q^PQj - Q^NQj 

is a (/?, r)-decomposition for symfWjr). This is because the diagonal entries of Q^-PQj (resp. 
Q^NQj) are simply a permutation (viz. tt') of the diagonal entries of P (resp. N). Since 
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ABA T >z if B y for any matrix A, the matrices Q^PQj and Q^PQj are both positive 
semidefinite. 

So now we show that T admits a (0(log(n)), 0(n log(n)))-decomposition. For convenience, we 
assume that re is a power of 2, i.e. n = 2 k for some integer k > 0. For n that are not a power of 2, 
we can readily obtain a decomposition by the following observation: if we take the smallest power 
of 2 that is larger than n, say 2 k , and consider the symmetrized triangular matrix for 2 k , then 
sym(T) can be expressed as a principal submatrix of it. Then taking the corresponding principal 
submatrices from the decomposition for the triangular matrix for 2 k we obtain a decomposition 
for n. This uses the fact that principal submatrices of positive semidefinite matrices are positive 
semidefinite as well. 

Theorem 16. Letn = 2 k for some integer k > 0. ThenT admits a (k+1, 4n(k+l))-decomposition. 

Proof. We show that sym(T) can be written as a difference of positive semidefinite matrices with 
diagonals bounded by k + 1. The bound on the sum of traces, An{k + 1), of the two matrices follows 
trivially. 

We use a recursive construction. Let the triangular matrix for re = 2 k be denoted by T&. For 
k = 0, the following is a decomposition for Tq with diagonals bounded by 1: 



sym(To) 



" 


1 " 






1 " 




" 1 


" 


1 







1 


1 







1 



So now assume that k > and we have a decomposition for T/%_i with diagonals bounded by k, 
i.e. 

T fc i 
Tj_! 



sym(T fc _i) 



P -N, 



where P, N y 0, and for all i € [2 k ], P(i, i),N(i, i) < k. We need the following block decomposition 
of P and N into contiguous 2 k ~ 1 x 2 k ~ 1 blocks as follows: 



r P A 




and N = 






pC 


pD 


N c 





Then we have the following decomposition of sym(Tfc). All the blocks in the decomposition below 
are of size 2 k ~ 1 x 2 k ~ 1 . 



sym(T fc ) 



1 





10 



+ 















T T 

L k-1 



T fc _i 























Now, consider the following decompositions of the two matrices above as a difference of positive 
semidefinite matrices. For the first matrix, the diagonals in the decomposition are bounded by 1: 



1 





10 



10 1 





10 1 



10 





1 
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For the second matrix, the diagonals in the decomposition are bounded by k. 






-k-1 










T T 










pA 





pB 












N s 















pA 





pB 







N A 















pC 





pD 







N c 





N D 
















pC 





pD 







N c 








It is easy to verify that the matrices in the decomposition above are positive semidefinite, since 
each is a sum of two positive semidefinite matrices. For example: 



pA 





pB 







- pA 





pB 


" 




" 














pA 





pB 
















+ 





pA 





pB 


pC 





pD 







pC 





pD 




















pC 





pD 





















pC 





pD 



Adding the two decompositions, we get a decomposition for sym(Tfc) as a difference of two positive 
semidefinite matrices. The diagonal entries of these two matrices are bounded by k + 1, as required. 

□ 



arc 



This decomposition is optimal up to constant factors. This is because the singular values of T 

||* = 0(n log(ra)). Thus, 



for k = 1, 2, . . . , n (see lElkiesI |20lj |). This implies that 



the best j3 one can get is G(log(n)), and the best r is 0(nlog(n)). 



5 Lower bounds 

In this section we prove the lower bounds stated in Section [2j 
5.1 Online Max Cut 

We prove Theorem [5l which we restate here for convenience: 

Theorem [5] restated: For any algorithm for the online max cut problem, there is a sequence 
of entries (it,jt) an d loss functions It for t = 1, 2, . . . ,T such that the regret of the algorithm 
is at least \J nT /16. 

Proof. Consider the following stochastic adversary. Divide up the time period T into n/2 equal 
sizqj intervals Tj, for i G [n/2], corresponding to the n/2 pairs of indices + n/2) for i € [n/2]. 
For every i S [n/2] and for each t S Tj, the adversary sets (it,jt) = + n/2) and yt to be a 
Rademacher random variable independent of all other such variables. Clearly, the expected regret 
of any algorithm for the online max cut problem equals 

Now, define the following subset of vertices A: for every i G [n/2], consider Si = J2teT Vt- ^ 
Si < 0, include both i,i + n/2 £ A, else only include i £ A. By construction, the matrix has 
the following property for all i £ [n/2]: 

W A {i,i + n/2) = sgn(^). 
4 We assume for convenience that ^ and — are integers. 

2 n ° 
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Using the definition of it and the fact that \Ti\ = 2T/n, we obtain 



E 



J2tt(W A (i,i + n/2)) 
teTi 



E 



E 



E 


(i- 




T 


\Si\~ 


T 

< 

n 


n 


2 




T 

in' 



where we used Khintchine's inequality: if X is a sum of A: independent Rademacher random vari- 
ables, then E[|X|] > \Jk/2. Summing up over all i G [n/2], we get that 



E 



T 



Y,£ t (W A (it,3t)) 



t=i 



< 



n 



T 

n 




T 
1 




Hence the expected regret of the algorithm is at least y^f-- In particular, there is a setting of the 



yt variables so that the regret of the algorithm is at least yj^- 



□ 



5.2 Online Collaborative Filtering with Bounded Trace Norm 

We start with the proof of Theorem [HI which we restate here for convenience: 

Theorem [8] restated: For any algorithm for online collaborative filtering problem with trace 
norm bounded by r, there is a sequence of entries (it,jt) and loss functions it for t = 1, 2, . . . , T 

such that the regret of the algorithm is at least G\j\TyfnT . 

any matrix W G [— l,l] mxn , we 

|| WH* < \J rank(W) || W||i? < \pm ■ \/mn = my/n, 

since rank(W) < m. So now we focus on the sub-matrix formed by the first row^l and all n 

columns. This sub-matrix has Tyfn entries. 

Consider the following stochastic adversary. Divide up the time period T into Ty/n intervals of 
length indexed by T^fn pairs (i, j) corresponding to the entries of the sub-matrix. For every 
and for every round t in the interval Iij corresponding to we set the loss function to 

be ^t(W) = atGWij, where at E { — 1,1} is a Rademacher random variable chosen independently 
of all other such variables. Note that the absolute value of derivative of the loss function is G. 

Clearly, any algorithm for OCF has expected loss 0. Now consider the matrix W* where 



Proof. First, we may assume that r < niy/n: this is because for any matrix W G [—1, 1]' 
have 



Vt G 



,j G [n] : Wtj = -sgn (E te ^tJ , 
and all entries in rows i > are set to 0. Since rankfW*) < -?=, we have 

||W*||* < v/rank(W*) • ||W*|| F < 



5 For convenience, we assume that -?= and — ^= are integers. 
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so w* e wE 

The expected loss of W* is 



ij 



> 



'J 



tela 



-Gr\/n • 



where the inequality above is again due to Khintchine's inequality. Hence, the expected regret of 
the algorithm is at least G\l\r^/nT . In particular, there is a specific assignment of values to ot 



such that the regret of the algorithm is at least Gy \r^fnT . □ 

The construction we used for deriving the above lower bound can be easily adapted to derive 
a lower bound on the sample complexity of learning the class W in the batch setting. This is 
formalized in Theorem [9j which we restate here for convenience. 

Theorem [9] restated The sample complexity of learning W in the batch setting, is £l{T^fn/e 2 ). 
In particular, when r = 0(n), the sample complexity is f](n 15 /e 2 ). 

Proof. For simplicity, let us choose m = n. Let k = r/y/n and fix some small e. Define a family 
of distributions over [n] 2 x {—1, 1} as follows. Each distribution is parameterized by a matrix W 
such that there is some / C [n], with \I\ = k, where W(i,j) £ {—1, 1} for i £ I and W(i, j) = for 
i (ji I. Now, the probability to sample an example (i,j,y) is (h + 2e) ^- if i 6 / and y = W(i,j), 
is Q — 2e) j— if i € / and y = —W(i,j), and the probability is in all other cases. 

As in the proof of Theorem [81 any matrix defining such distribution is in W. Furthermore, if 
we consider the absolute loss function: ^(W, (i, j, y)) = i|VF(i, j) — y\, then the expected loss of W 
with respect to the distribution it defines is 

E[k\W(i,j)-y\] =\-2e. 

In contrast, by standard no-free-lunch arguments, no algorithm can know to predict an entry (i, j) 
with error smaller than \ — £ without observing Q(l/e 2 ) examples from this entry. Therefore, no 
algorithm can have an error smaller than i — e without receiving Q(kn/e 2 ) examples. □ 



6 Implementation Details 

In general, the update rule in Algorithm [1] is a convex optimization problem and can be computed 
in polynomial time. We now give the following more efficient implementation which takes essen- 
ti ally Q(p 3 ) time pe r round. This is based on the following theorem that is essentially proved 



m 



Tsuda et all [20061 ]: 



6 This construction is tight: e.g. if W* is formed by taking -4^ rows of an n x n Hadamard matrix. 
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Theorem 17. The optimal solution o/arg minxeA: A(X, Y), where Y is a given symmetric matrix, 
and 

K, := {X G § nxn : Aj • X < 6j /or j = 1, 2, . . . , m}, 

is given by 

X* = exp(log(Y)-£™ =1 a*A;.), 
where A'- = \{Aj + Aj), and a* = (a^, a^, • • • , a^) * s given by 

a* = arg max -Tr(exp(log(Y) - EiLi a i A ',')) ~ TJj=i a 3 h r 

Vj'S[m]: «j>0 

The idea is to avoid taking projections on the set fC in each round. If the chosen entry in round 
t is (it,jt), then we compute ~K t as 

X t = arg min A(X, exp(log(X t _i - r?L t _i)), 
where the polytope is defined as 

K t := | XG§ 2px2p s.t. 

X(i t , i t ) + X(j t + q,j t + q)+ X(p + it,p + it) + X(p + jt + q,P + jt + q) < 4/3 
X(i t ,j t + q)-X(p + i t ,p + j t + q)) < 1 
X(p + i t ,p + j t + q))-X(i t ,j t + q) < 1 

Tr(X) < r | 

The observation is that this suffices for the regret bound of Theorem [TH to hold since the optimal 
point in hindsight X* G fC t for all t (see the proof of Theorem 1 13p. 

Note that /Q is defined using just 4 constraints, and hence the dual problem given in Theorem [T71 
has only 4 variables aj. Thus, standard convex optimization techniques (say, the ellipsoid method) 
can be used to solve the dual problem to e-precision in 0(log(l/e)) iterations, each of which requires 
computing the gradient and/or the Hessian of the objective, which can be done in 0(p 3 ) time via 
the eigendecomposition, leading to an 0(p 3 ) time algorithm overall. 

More precisely, the iteration count for convex optimization methods have logarithmic depen- 
dence on t he range of th e a,- variables. Si nce Tr(X^_i) < t, we see (using the Golden-Thompson 
inequality [Goldenl . Il965l . iThompsonl . 1965]) that 



Tr(exp(log(X t _ 1 -7/L t _i))) < X t _i • exp(-7/L t _i) < 3r. 

Thus, setting all aj = 0, the dual objective value is at least — 3r. Since bj > 1 for all j, we get 
that the optimal values of aj are all bounded by 3r. Thus, the range of all aj can be set to [0, 3r], 
giving a 0(log(|)) bound on the number of iterations. 
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7 Conclusions 



In recent years the FTRL (Follow The Regularized Leader) paradigm has become the method of 
choice for proving regret bounds for online learning problems. In several online learning problems 
a direct application of this paradigm has failed to give tight regret bounds due to suboptimal 
"convexification" of the problem. This unsatisfying situation occurred in mainstream applications, 
such as online collaborative filtering, but also in basic prediction settings such as the online max 
cut or online gambling settings. 

In this paper we single out a common property of these unresolved problems: they involve struc- 
tured matrix prediction, in the sense that the matrices involved have certain nice decompositions. 
We give a unified formulation for three of these structured matrix prediction problems which leads 
to near-optimal convexification. Applying the standard FTRL algorithm, Matrix Multiplicative 
Weights, now gives efficient and near optimal regret algorithms for these problems. In the process 
we resolve two COLT open problems. The main conclusion of this paper is that spectral anal- 
ysis in matrix predictions tasks can be surprisingly powerful, even when the connection between 
the spectrum and the problem may not be obvious on first sight (such as in the online gambling 
problem). 

We leave open the question of bridging the logarithmic gap between known upper and lower 
bounds for regret in these structured prediction problems. Note that since all the three decom- 
positions in this paper are optimal up to constant factors, one cannot close the gap by improving 
the decomposition; some fundamentally different algorithm seems necessary. It would also be in- 
teresting to see more applications of the (/3, r)-decomposition for other online matrix prediction 
problems. 
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A Matrix Multiplicative Weights Algorithm 

For the sake of completeness, we prove Theorem [TBI The setting is as follows. We have an 
online convex optimization problem where the decision set is a convex subset /C of N x N positive 
semidefinite matrices of trace bounded by r, viz. for all X € /C, we have X >z and Tr(X) < r. 
We assume for convenience that -^1 G IC. In each round t, the learner produces a matrix Xf G /C, 
and the adversary supplies a loss matrix L 4 G W NxN , which is assumed to be symmetric. The loss 
of the learner is Xf • Lt- The goal is to minimize regret defined as 

T T 
Regret := > X+ • L* — min > X • L+. 

t=i t=i 

Consider Algorithm [TJ We now prove Theorem 1131 which we restate here for convenience: 
Theorem 18. Suppose n is chosen so that ^||Lt|| < 1 for all t. Then 

D * T2 ■ Tl °g( iV ) 

Regret < n > X f • Lf H . 

^— ' ri 

t=i 1 

Proof. Consider any round t. Let X G K, be any matrix. We use the quantum relative entropy, 
A(X,X(), as a potential function. We have 

A(X, exp(log(X t ) - r/L t )) - A(X, X t ) = ??X . L t - Tr(X t ) + Tr(exp(log(X t ) - r/L t )). (5) 

Now quantum relative entropy projection onto the set /C is a Breg man projection, and hence the 
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Generalized Pythagorean inequality applies (see iTsuda et al.l 

A(X,X m ) + A(X m ,exp(log(X t )-??L i ))) < A(X, exp(log(X t ) - rjL t ))), 
and since A(X t+ i, exp(log(X t ) — r/Lt))) > 0, we get that 

A(X,X m ) < A(X,exp(log(Xt)-r/Lt))). 
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Hence from (J5]) we get 

A(X,X t+1 )-A(X,X t ) < r ? X.Lt-TV(X i ) + Tr(exp(log(X t )-r ? L t )). 
Now, using the Golden-Thompson inequality Golden . 19651 . Thompson! . 19651 ] . we have 

Tr(exp(log(X t ) - r]L t )) < Tr(X t expf-^)) 
Next, using the fact that exp(A) ^ I + A + A 2 for || A|| < 10 we obtain 
Tr(X t exp(-TjLt)) < Tr(X t (I - V L t + V 2 L 2 t ) 

Combining the above and plugging into ([6]) we get 

A(X,X m )- A(X,X f ) < r ] X»L t -r ] XfL t + r ] 2 XfLl 
Summing up from t = 1 to T, and rearranging, we get 

r + <r V^y t2 , A(X,Xx)-A(X,X r+1 ) 
Regret < n > X 4 • H 



t=i 

T 

< ^X t .L| + 

t=\ 



V 



rlog(N) 



(6) 



(7) 



since A(X, X^+i) > and 

A(X,X0 = X . (log(X) - log(ftl)) - Tr(X) + r 

= X . log(iX) + log(r)Tr(X) - log(ft)Tr(X) - Tr(X) + r 

< Tr(X)(log(iV) - 1) +r 

< rlog(iV). 

The first inequality above follows because Tr(X) < t, so log(-X) -< 0. The second inequality uses 
Tr(X) < r. 

□ 

B Technical Lemmas and Proofs 



Lemma 19. For m x n non-symmetric matrices W ; if W = USV T is the singular value decom- 
position o/W, then 



sym(W) 



—V -—V 











is the eigenvalue decomposition of sym(W). In particular, ||sym(W) 



2IIWI 



7k- ' 



7 To see this, note that we can write A = VDV T for some orthonormal V and diagonal D. Therefore, 

I + A + A 2 - e A = V (l + D + D 2 - e D ^ V T . 

Now, by the inequality 1 + a + a 2 — e a > 0, which holds for all a < 1, we obtain that all elements of the diagonal 
matrix (I + D + D 2 — e D ) are non-negative. 
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Proof. By the block matrix multiplication rule we have 



7 U 

— V 

v / 2 V 
1 



-— V 







-4VS -4V£ 





vsu 



w 1 



V2 

usv 





w 





_Lu T ^-V T 
J_U T Ly T 

L V2 



In addition, it is easy to check that the columns of 



are orthonormal. It follows 



Ly 

that the above form is the eigendecomposition of sym(W). Therefore, for any Schatten norm: 
||sym(W)|| = 2||S|| = 2||W||, which concludes our proof. □ 



C The optimal cut in the Online Max Cut problem 

We prove Lemma [21 which we restate here for convenience. 

Lemma [2] restated Consider an online sequence of loss functions {It = \\yt — Vy\} ■ Let 

W* = arg min V l t (W (i t , jt)) ■ 



weW' 



Then W* = for the set A that determines the max cut in the weighted graph over [n] 
nodes whose weights are given by Wij = Ylf(i t jt)=(i j) ^* ^ or ever y (hj)- 

Proof. Consider W^. For each pair (i,j) let c^,c~- be the total number of iterations in which 
the pair (i,j) appeared in the adversarial sequence with y t = 1 or y t = —1 respectively. Since 
yt £ [— 1, 1] we can rewrite the total loss as: 



* (id) 



2 

(id) 



2 

(id) 

Where Ct is a constant which is independent of W^. Hence, minimizing the above expression is 
equivalent to maximizing the expression: 



(hlY- W A (i,j)=l 



i.r 



Since ^Zuj-\ Wij is a constant independent of A, the cut which maximizes this expression is the 
maximum cut in the weighted graph over the weights w^. □ 
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D Optimality of Decomposition for Collaborative Filtering 

In this section, we prove the following theorem: 

Theorem 20. Consider the matrix W formed by taking m = rows of an n x n Hadamard 
matrix. This matrix has ||W||^ = r, and any (f3,f)- decomposition for sym(W) has 

Bf > —T\[rt. 
4 

Proof. Since the rows of W are orthogonal to each other, the m singular values of W all equal y/n, 
and thus ||W||* = m^Jn = r. Further, the SVD of W is (here, I m is the m x m identity matrix): 

W = I m (v^I m )(^W). 

Using Lemma [19] the eigendecomposition of sym(W) can be written as 

sym(W) = U(V^I m )U T +V(-^I m )V T , 

where 

U = [fan, 4-W] T and V = [-U m , -^W] T 



are p x m matrices with orthonormal columns. 

Let sym(W) = P — N be a (/3, f )-decomposition. Now consider the following matrices: first, 
define the p x p diagonal matrix 



D 



t 1 - 

Iran 



pi mn t 

u 273? n 



Finally, define the p x p positive semidefinite matrix 

Y := DUU T D. 
Since U has orthonormal columns we have UU T -< I p , and so 



Y < DI P D = D 2 . 



Now, consider 



Y.sym(W) = Y • (P - N) 

< Y • P (y Y, N h 0, so Y • N > 0) 

< D 2 .P (vY^ D 2 ) 

m p 

P(i,i)+ > — P(i,i) 

2m y ' ; ^ 8f K ' 

i=l i=m+l 

1 „ mn 

< -8+ r, 
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since P(i, i) < B for all i and Tr(P) < r. We also have 

Y.sym(W) = Tr(DUU T Dsym(W)) 
= Tr(UU T Dsym(W)D) 



n r 



-^Tr(UU' sym(W)) 

4f 



(Y Dsym(W)D = sym(^W)) 



„ Tr(UU T [U(V^I m )U T + V(-v^I m )V T ]) 
4r 
mn 

= If ' 

since U T V = 0. Putting the above two inequalities together, we have 

mn 1 „ mn 
< -B-\ , 



which implies that 
as required. 



Bf > -mn = -T\fn 
4 4 



□ 



E Relation between (j3, r)-decomposition, max-norm and trace- 
norm 

In t his section , we co nsider mx n non-symmetric matrix W. The max-norm of W is defined to be 
fsee iLee et all \201^ ) the value of the following SDP: 



>- 



min t 

Yi W 
W T Y 2 

Vi G [m],j e [n] : Y^i), Y 2 (j,j) < t. (8) 
The least possible B in any (J3, r)-decomposition for W is given by the following SDP: 

min B 

= P-N 



W 

W T 



p, n y o 

Vie[m + n]: P(i,i), iV(i,i) < /?. 



(9) 



Theorem 21. XTie Zeas£ possible 8 in any (/3,r) -decomposition exactly equals half the max-norm 
ofW. 

Proof. Let i* and B* be the optima of SDPs ([8]) and ([9]) respectively. Let Yi, Y 2 be the optimal 
solution to SDP flSJ), so that for all z E [m], j G [n] we have Yi(i,i), Y2C7, j) < £*• Consider the 
matrices 



1 



Yi W 

^2 



W T Y 



and N 



1 



Yi -W 
-W T Y, 
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Using the feasibility of Yi, Y2 and Lemma 1221 we get that P,N y 0. Thus this is a feasible 
solution to SDP Q. Hence, we conclude that t* > 2/3*. 

Now let P, N be the optimal solution to SDP (|8|). so that for all i G [m + n] we have 
P(i,i), N(i,i) < /3*. Consider the blocks of P and N formed by the first m indices and the 
last n indices: 



r p a 


pB - 


and N = 




N B - 


pC 


pD _ 


_ N c 


N D _ 



Since N ^ 0, by Lemma [22] the following matrix is positive semidefinite as well: 



N' 



y 0. 



So P + N' >- 0, i.e. 



P + N' 



P A + N A W 
W T P D + N D 



y 0. 



Thus, Yi = P A + N A and Y 2 = P D + N D is a feasible solution to SDP ©. Now for all i G [m] 
we have Y\_(i,i) < P A (i,i) + N A (i,i) < 2(3*, and similarly for all j G [n] we have j) < 2/3*. 
Thus, we conclude that t* < 2/3*. □ 

Lemma 22. Lei P be a positive semidefinite matrix of order m + n and let 

pA pB 
pC pD 

be the block decomposition of P formed by the first m indices and the last n indices. Then the 
following matrix is positive semidefinite: 



P' 



pA 



_pB 
pD 



Proof Since P y 0, there are vectors Vj, for all i,j G [m + n] such that P(i,j) = Vj • Vj. Then 
consider the vectors 

J Vj if i G [m] 
I — v,- otherwise. 



w, 



It is easy to check that for all i,j G [m + n] we have P'(i,j) = w; • wj. Thus, we conclude that 

p' y 0. □ 

Finally, we show the connection between the trace-norm and the least possible r in any (/3, r)- 
decomposition: 

Theorem 23. The least possible r in any ((3, t)- decomposition exactly equals twice the trace-norm 
ofW. 

Proof. Let r* be the least possible value of r in any (/3, r)-decomposition, and let P, N be positive 
semidefinite matrices such that sym(W) = P — N and Tr(P) + Tr(N) = r*. Then by triangle 
inequality, we have 

||sym(W)|U < ||P|U + ||N|U. 
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Since ||sym(W)||* = 2||W||*, ||P||* = Tr(P), and ||N||* = Tr(N), we conclude that r* > 2||W||*. 
Now, let 

sym(W) = ^A i v i v7 

i 

be the eigenvalue decomposition of sym(W). Now consider the positive semidefinite matrices 

P= ^2 X * V i V J and N = ^2 ~ X iVivJ. 

i: A;>0 i: Xi<0 

Clearly sym(W) = P - N, and 

Tr(P)+Tr(N) = £|Ai| = ||sym(W)||* = 2||W||*. 

i 

Hence, r* < 2||W||*, completing the proof. □ 
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